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Abstract 

We compare two approaches to compute a fraction of the spectrum of dense sym- 
metric definite generalized eigenproblems: one is based on the reduction to tridi- 
agonal form, and the other on the Krylov-subspace iteration. Two large-scale 
applications, arising in molecular dynamics and material science, are employed 
to investigate the contributions of the application, architecture, and parallelism 
of the method to the performance of the solvers. The experimental results 
on a state-of-the-art 8-core platform, equipped with a graphics processing unit 
(GPU), reveal that in realistic applications, iterative Krylov-subspace methods 
can be a competitive approach also for the solution of dense problems. 



1. Introduction 

We consider the solution of the generalized eigenproblem 

AX = BXA, (1) 

where A, B £ M. nxn are given, A £ ~BL SXS is a diagonal matrix with the s 
sought-after eigenvalues, and the columns of A £ M. nxs contain the correspond- 



ing unknown eigenvectors 17 |. When the pair (A,B) consists of a symmetric 
and a symmetric positive definite matrix, Eq. (Q]) is normally referred to as 
a symmetric- definite generalized eigenproblem (GSYEIG). We are interested in 
large-scale GSYEIGs arising in the simulation of molecular dynamics [HJ and ab 
initio simulations of materials in these applications, A and B are symmetric 
and dense, B is positive definite (SPD), n O(10, 000) - O(100, 000), and only 
few eigenpairs (eigenvalues and associated eigenvectors) are required: s <C n. 
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For the solution of GSYEIGs with dense (A, B), there exist two numerically 
stable approaches: the "tridiagonal-reduction" and the " Krylov- sub space itera- 
tion" [17] . Both of them start by transforming — either explicitly or implicitly — 
the generalized problem ([1) into a standard one (STDEIG). Specifically, consider 
the Cholesky factorization of B given by 

B = U T U, (2) 



where U E R nx ™ is upper triangular [171 ]: then the GSYEIG can be transformed 
into the STDEIG 

GY = YA, = {U- T AU- 1 ){UX) = {UX)k, (3) 

where C € R" x ™ is symmetric, and Y € M. nxs contains the eigenvectors asso- 
ciated with this problem. While the eigenvalues of the GSYEIG ([1]) and the 
STDEIG @ are the same, the eigenvectors X of GSYEIG can be easily recov- 
ered from those of STDEIG, Y, by solving the upper triangular linear system 

X := U^Y. (4) 

After this preliminary transformation, the tridiagonal-reduction approach 
employs orthogonal transforms to reduce C to tridiagonal form, from which the 
eigenpairs can be computed. On the other hand, the Krylov-subspace approach 
operates with C (either directly or via the matrices A and U), iteratively approx- 
imating the largest (or smallest) eigenpairs of the system through matrix- vector 
multiplications. When dealing with dense coefficient matrices, both families of 
numerical methods exhibit a computational cost of 0(n 3 ) floating-point arith- 
metic operations (flops), due to the transformation to standard form and, in the 
tridiagonal-reduction approach, the reduction to tridiagonal form. Therefore, 
the solution of large-scale dense eigcnproblems, as those appearing in molecu- 
lar dynamics or ah initio simulations, clearly calls for the application of high- 
performance computing techniques on parallel architectures. 

Traditionally, the tridiagonal-reduction approach has been regarded as the 
method-of-choice for the solution of dense eigenvalue problems while the Krylov- 
subspace alternative was preferred for sparse matrices. However, as we will 
show in this paper, on parallel architectures, the Krylov-subspace method is 
a competitive option for the solution of dense eigenvalue problems, and the 
adoption of one method over the other should be based instead on a variety of 
factors, such as the number of required eigenpairs and the target architecture. 

The major contribution of this paper is an experimental study of these two 
classes of numerical eigensolvers, implemented using parallel linear algebra li- 
braries and kernels for current desktop platforms, for two large-scale appli- 
cations. Following the evolution of computer hardware, we include two dis- 
tinct architectures in the evaluation: A system equipped with (general-purpose) 
multi-core processors from Intel, and a hybrid computer that embeds multi- 
core processors with (one or more) NVIDIA "Fermi" GPUs (graphics processor 
units). For brevity, we will refer to both multi-core processors and GPUs as 
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multi-threaded architectures. The linear algebra libraries include well-known 
packages like LAPACK [| or BLAS, as well as alternatives for multi-threaded 
architectures like PLASMA, libflame or MAGMA HI HI HI] . 

The rest of the paper is structured as follows. In Section [5] we review the 
different eigcnsolvcrs that are considered in this work, offering a brief descrip- 
tion of the underlying numerical methods and their computational and storage 
costs. In Section [3] we describe the experimental setup: The two large-scale ap- 
plications leading to dense GSYEIGs, and the hybrid multi-core/GPU platform 
on which we conduct the experiments. In Section U] we revisit the numerical 
methods, now from the point of view of conventional software libraries (LA- 
PACK, BLAS, SBR, ARPACK) that can be employed to implement them, and 
evaluate these implementations on the target multi-core processor, via the two 
case studies. We then repeat the experimentation using more recent libraries, 
specifically designed to leverage task-parallelism and/or hardware accelerators 
like the GPUs in Section [5] A short discussion of concluding remarks as well as 
future work is provided in Section [5] 



2. Generalized Symmetric Definite Eigenvalue Solvers 

In this section we first review the initial transformation from GSYEIG to 
STDEIG, and the final back-transform. We then describe the two approaches for 
the solution of STDEIG, — tridiagonal-reduction and Krylov-subspace iteration — 
and a number of algorithmic variants. We will assume that, initially, the storage 
available to the methods consists of two nxn arrays (for the data matrices A and 
-B), and annxs array (for the requested s eigenvectors). Hereafter we neglect 
the space required to store the s sought-after eigenvalues as well as any other 
lower order terms in storage costs. Analogously, in the following we neglect the 
lower order terms in the expressions for computational costs. 

2.1. Transformation to and from STDEIG 

The initial factorization in @ requires n 3 /3 flops, independently of s, the 
number of eigenpairs requested. In practice, the triangular factor U overwrites 
the corresponding entries in the upper triangular part of B so that the demand 
for storage space does not increase. The algorithmic variants of the tridiagonal- 
reduction approach require the matrix C := U~ T AU" 1 to be cxplicitely built; 
the same is true for one of the variants of the Krylov-subspace approach. In all 
cases, the entries of A can be overwritten with the result C. The computational 
cost for this operation amounts to 2n 3 flops if C is computed by solving two 
triangular linear systems. By exploiting the symmetry of C, the cost can instead 
be reduced to n 3 flops; again, the cost is independent of s. Conversely, the final 
back-transform Q costs n 2 s flops, and this operation can be performed in-placc. 

2.2. Tridiagonal-reduction approach 

Once GSYEIG has been transformed to STDEIG, we consider two alterna- 
tive methods for reducing C to tridiagonal form. The first one performs the 
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reduction in a single step, while the second employs two (or possibly more) 
steps, reducing the full and dense C to a banded matrix, and from there to the 
required tridiagonal form. 



Variant TD.\ Tridiagonal-reduction with Direct tridiagonalization. Efficient al- 
gorithms for the solution of Eq. (|3|) usually consist of three stages. Matrix C 
is first reduced to symmetric tridiagonal form by a sequence of orthogonal sim- 
ilarity transforms: Q T CQ = T, where Q £ R nxra is the matrix obtained from 
the accumulation of the orthogonal transforms, and T £ R™ xri is the resulting 
tridiagonal matrix. In the second stage, a tridiagonal eigensolver, for instance 
the MR 3 algorithm [l4|, EJ, is employed to accurately compute the desired s 
eigenvalues of T and the associated eigenvectors. In the third and last stage, 
a back-transform yields the eigenvectors of C; specifically, if TZ = ZA, with 
Z £ M. nxs containing the eigenvectors of T, then Y := QZ. The first and last 
stages cost 4n 3 /3 and 2n 2 s flops, respectively The complexity of the second 
stage, when performed by the MR 3 algorithm, is 0(ns) flops for computing s 
cigenpairs. (Other alternatives for solving symmetric tridiagonal eigenproblcms, 
such as the QR algorithm, the Divide & Conquer method, etc. ljj require OJn 3 ) 
flops in the worst case, and are rarely competitive with the MR 3 algorithm |2l|.) 

In this three-stage method, the orthogonal matrix Q is never constructed; 
the corresponding information is implicitly stored in the form of Householder 
reflectors in the annihilated entries of C . Therefore, for this first variant of the 
tridiagonal-reduction approach, there is no significant increase of the memory 
demand. 



Variant TT:. Tridiagonal-reduction with Two-stage tridiagonalization. One ma- 
jor problem of variant td is that half of the computations required to reduce 
C to tridiagonal form are performed via Level 2 BLAS operations; these oper- 
ations are considerably less efficient than the Level 3 BLAS kernels, especially 
on current multi-threaded architectures. 

An alternative to obviate this problem is to perform the reduction in two 
steps, first transforming the matrix C from dense to band form (W £ R nxra , 
with bandwidth w) and then from band to tridiagonal form. Provided (32 < 
) w •< n, this allows casting most computations during the reduction process 
(QfCQi = W, Q2WQ2 = T) in terms of efficient Level 3 BLAS operations, at 
the expense of a higher computational cost. (The choice of the value 32 is based 
on experimental experience with this algorithm Q. In particular, increasing this 
blocking factor permits a better reuse of cached data; however, it rapidly raises 
the computational cost of the subsequent reduction from band to tridiagonal 
form. Therefore, a balance between these two factors is needed.) In particular, 
reducing the matrix to tridiagonal by such a two-stage method basically requires 
4n 3 /3 flops to obtain W, and a lower-order amount for refining that into T. 
However, due to this double-step, recovering Y from the eigenvectors of T, as 
Y := Q1Q2Z, adds 7n 3 /3 + 2n 2 s flops to the method, and thus yields a much 
higher cost than for the previous alternative. Specifically, the full n x n matrix 
Ql needs to be explicitly constructed, which requires 4n 3 /3 flops. Then, one 
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needs to accumulate Q1Q2, for n 3 flops, and finally calculate {QxQ-^Z, for an 
additional 2n 2 s flops. 

In practice, the accumulation of Q\ is done by multiplying the identity ma- 
trix from the right with a sequence of the orthogonal transforms that are re- 
quired to reduce panels (column blocks) of the input matrix to banded form. 
Therefore, these accumulations can be completely performed via Level 3 BLAS 
operations (explicitly, via two calls to the matrix-matrix product per panel as 
the orthogonal transforms are applied by means of the WY representation) . On 
the other hand, during the reduction from band to tridiagonal form, the ma- 
trix Q2 is not explicitly constructed, but accumulated from the right into the 
previously constructed Q\. Although the reduction itself does not proceed by 
blocks, the accumulation of the orthogonal transforms are delayed to introduce 
blocked operations for the update. Therefore, the construction of Q1Q2 is fully 
cast in terms of Level 3 BLAS operations. 

The banded matrix W can be saved in compact form overwriting n x w 
entries of A. Unfortunately, in this approach we need to explicitly build the full 
matrix Q\. which requires space for an additional n x n array. 

2.3. Krylov- sub space approach 

Instead of reducing matrix C to tridiagonal form, one can employ (a variant 
of) the Lanczos procedure (l7j to iteratively construct an orthogonal basis of the 
Krylov subspace associated with C . At each iteration, by using a recursive three 
term relation, the procedure calculates a tridiagonal matrix T m of dimension 
m x to, with 2s < m, <C n, whose extremal eigenvalues approximate those of C, 
and a matrix V m of dimension n x m with the corresponding Krylov vectors. If 
the eigenvalues of T m accurately approximate the s sought-after eigenvalues of 
C, the iteration is stopped. Otherwise, the best s approximations are used to 
restart the Lanczos procedure Under certain conditions, and especially for 
symmetric matrices, the process often exhibits fast convergence. 

Despite the simplicity of the Lanczos procedure, due to floating point arith- 
metic, the orthogonality between the column vectors of V m is rapidly lost. As 
a consequence, once an eigenvalue has been found, the algorithm might fail to 
"remember" it, thus creating multiple copies. A simple method to overcome 
this issue is to perform the orthogonalization twice, as suggested by Kahan in 
his unpublished work and later demonstrated by formal analysis )l6l |. Alterna- 
tively, orthogonality can be monitored during the construction of the subspace, 
"ammending" it in case it is lost. Re-orthogonalizing Lanczos vectors once adds 
a variable computational cost to the algorithm, which can be up to 0(mn) in 
the worst scenario. The cost of obtaining the eigenpairs from T mi V m is 0(m 2 ) 
flops. Moreover, the dimension of the auxiliary storage space required in these 
methods is of the order of n x m or smaller. 

Variant KE.\ Krylov- subspace with Explicit construction of C. Like in the two 
variants of the previous approach, variant ke explicitly builds the matrix C, as 
illustrated in Eqn ([3]). Each iteration k of the Krylov subspace method then 
performs a (symmetric) matrix- vector product of the form Zk+i '■= Cwk, with 
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Zfc=i,2,... € R™, Wfc = o,i,... € E n , and Wq an initial guess, requiring 2n 2 flops per 
product. While obtaining Wk+i from z^ + i only requires a few operations of 
linear cost in n, the re-orthogonalization (in case it is needed) has an cost that 
varies between 0(n) and 0(mn) flops (best and worst case) and, potentially, can 
contribute substantially to the total computational time already for moderate 
values of m. In addition, a (data-dependent) number of implicit restarts are 
needed after the Lanczos augmentation step, each involving the application of 
the QR iteration to the tridiagonal matrix T m , thus resulting in a cost of 0{nm 2 ) 
flops per restart. 

Variant KI.\ Krylov-subspace with Implicit operation on C . In this variant, the 
matrix C is not formed. Instead, at each iteration of the iterative method, the 
calculation z^+i := U~ T AU^ 1 Wk is performed as a triangular system solve, 
followed by a matrix-vector product and, finally, a second triangular system 
solve: Zk+i ■= U~ T (A(U~ 1 Wk))- In this variant, there is no initial cost to pay 
for the explicit construction of C, but the cost per iteration for the computation 
of Zk+i doubles with respect to the previous case, from In 2 to An 2 flops. In the 
iteration, obtaining Wk+i from Zk+i requires 0(n) flops, and the aforementioned 
re-orthogonalization costs 0{nm) flops; in addition, each of the restarting steps 
performs 0(nm 2 ) flops. 



3. Experimental Setup 

In this section we briefly introduce the two benchmark applications that 
require the solution of dense GSYEIG and the platform on which we carried 
out the numerical experiments. 



3.1. Molecular Dynamics 

The first application-generated GSYEIG appears in molecular simulations 
of biological systems using normal mode analysis (NMA) in internal coordi- 
nates. Normal mode analysis (NMA) merged with coarse-grained models (CG) 
has proven to be a powerful and popular alternative of standard molecular dy- 
namics to simulate large collective motions of macromolccular complexes at 



extended time scales |2a, |24[ . In the approach, biomolecule atomic degrees 



of freedom are treated explicitly in solving the generalized eigenvalue problem 
in a biologically relevant conformation. The computed eigenvalues, also known 
as modes, form an orthonormal basis of displacements, i.e. any biomolecule 
conformational change can be expressed as a linear combination of the modes. 
Furthermore, excellent correlation has been found between the motion char- 
acterized only by the low frequency modes and the experimentally observed 
functional motions of large macromolecules. In the approach leading to the 
data matrices for this example, a recent implementation delivers the NMA low 
frequency modes by using dihedral angles as variables and employing different 
multi-scale CG representations [HJ]. This very efficient tool has been applied 
successfully to predict large-scale motions enzymes, viruses, and large protein 
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assemblies from a single conformation 18(. As illustrative case, in this case 



we used the biological relevant low frequency modes using default parameters. 
This system comprises n=9,997 internal coordinates to be solved in a general- 
ized eigenproblcm with both A and B SPD matrices. For the characterization 
of the collective motion, only about 1% of the smallest eigenpairs are needed. 
In order to accelerate the convergence of the Lanczos iteration of the Krylov- 
subspacc approach, in the experiments we compute the largest eigenpairs of the 
inverse problem BX = AXA^ 1 . 

3.2. Density Functional Theory simulations 

The second eigenproblem appears within an ah initio simulation arising in 
Density Functional Theory (DFT), one of the most effective frameworks for 
studying complex quantum mechanical systems at the core of materials science. 
DFT provides the means to solve a high-dimensional quantum mechanical prob- 
lem by transforming it into a large set of coupled one-dimensional equations, 
which is ultimately represented as a non-linear generalized eigenvalue problem. 
The later is solved sclf-consistently through a series of successive iteration cy- 
cles: the solution computed at the end of one cycle is used to generate the input 
in the next until the distance between two successive solutions is negligible. 

Typically a simulations requires tens of cycles before reaching convergence. 
After the discretization - intended in the general sense of reducing a continu- 
ous problem to one with a finite number of unknowns - each cycle comprises 
dozens of large and dense GSYEIGs : x ~ XB^'x where A is Hermi- 
tian and B Hcrmitian positive definite. Within every cycle, the eigenproblems 
are parametrized by the reciprocal lattice vector k, while the index % denotes 
the iteration cycle. The size of each problem ranges from 10,000 to 40,000 and 
the interest lies in the eigenpairs corresponding to the lower 10-20% part of the 
spectrum; the solution of such eigenproblems is one of the most time-consuming 
stages in the entire simulation. 

The problem solved in this paper comes from the simulation of the multi- 
layer material GeSb2Te4, one of the phase-changing materials used in rewritable 
optical discs (CDs, DVDs, Blu-Rays discs) and prototype non- volatile memories. 
The matrices (carrying indices i = 10 and k = 1) were originated with the 
FLEUR code [9[ at the Supercomputing Center of the Forschungszentrum Jiilich. 
The size of the eigenproblem is n =17,243 and the number of eigenpairs searched 
for is s =448, corresponding to the lowest 2.6% of the spectrum. 

3. 3. Target platform 

The experiments were carried out using double-precision arithmetic on a 
platform equipped with two Intel Xeon Quadcore processors E5520 (8 cores at 
2.27 GHz), with 24 GBytcs of memory, connected to an NVIDIA Tesla C2050 
(Fermi) GPU (480 cores at 1.15 GHz) with 3 GBytes of on-device memory. The 
operating system is the 64-bit CentOS 5.4. The following software libraries were 
employed: ARPACK 1.4.1, CUBLAS 4.0, CUDA driver 4.0, Intel MKL 10.3, 
GotoBLAS2 1.11, libf lame 5.0, MAGMA 1.0 RC5, PLASMA 2.4.2, and SBR 
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1.4.1. Codes were compiled using gcc 4.1.2 and/or gf ortran 4.1.2 with the -03 
optimization flag. 

A large effort was made to optimize parameters like the block size of the 
various routines, the bandwidth for variant tt, the number of Krylov vectors 
(m) for ke and ki, etc. For the Krylov-subspace methods, the stopping threshold 
of routine dsaupd was set to the default (tol=0). Internally, the code accepts 
the computed cigenpairs if the estimated relative residuals arc below the machine 
precision. 

4. Conventional Libraries for Multi-core Processors 

4-1- Exploiting multi-threaded implementations of BLAS 

In the dense linear algebra domain, the traditional approach to exploit the 
concurrency of a platform equipped with multiple processors (or cores) relies 
on the usage of highly-tuned, multi-threaded implementations of BLAS, often 
provided by the hardware vendors (Intel's MKL, AMD's ACML, IBM's ESSL, 
etc.) or by independent developers (e.g., GotoBLAS2). During the past decade, 
this was successfully leveraged by LAPACK Q as well as libf lame (2tJ to yield 
acceptable speed-ups with no effort on the programmer's side. 

The combination of LAPACK and BLAS provides most of the functionalities 
required to construct all the four algorithms (td, tt, ke, ki) to solve GSYEIGs 
on multi-core processors; see Table [T] Although LAPACK provides a specific 
routine to construct C := U~ T AU" 1 (dsygst), in our tests we found that 
computing C via two triangular system solves (dtrsm) was faster; therefore 
this is the option selected in our implementations. 

The missing components are provided by the SBR (Successive Band Reduc- 
tion) toolbox [8| and the ARPACK library [3[ . The former contains software for 
reducing a full/band matrix to band/tridiagonal form via orthogonal similarity 
transformations (variant tt) , while the later implements an implicitly restarted 
version of the Lanczos iteration (to obtain Wk+i from Zk+i', see variants KE 
and Ki in subsection 12. 3|) . In the SBR toolbox, parallelism can be obtained 
using a multi-threaded BLAS. Conversely, the benefits of a parallel execution 
of ARPACK — which mostly performs Level 1 and 2 BLAS operations — will 
not be as significant. In principle, the computation of the eigenvalues of the 
tridiagonal matrix T m and the eigenvectors from the Krylov vectors in V m add 
a minor cost to the overall computation due to the reduced value of m compared 
with the dimension of the problem. ARPACK employs a modified version of 
the symmetric iterative QR algorithm for this purpose 

4-2. Experimental evaluation 

Table[2]reports the execution time of the four eigensolvers td, tt, ke, and ki 
for the solution of both MD's and DFT's GSYEIG on the multi-core platform. 
The solvers are implemented using routines from the conventional software li- 
braries listed above, and compute 100 («1%) and 448 («2.6%) eigenpairs for 



Stage 


Appr. 


Var. 


Operation 


Routine 


Library 


(!) 






GSl 


B = U T U -> U 


DPOTRF 


LAPACK 






GS2 


C:=U- T AXJ- X 


dsygst/dtrsm 


LAPACK/BLAS 








TDl 


Q T CQ =T 


DSYTRD 


LAPACK 






TD 


td2 


TZ = ZA^T,Z 


DSTEMR. 


LAPACK 




Trid. 
Reduct. 




td3 


Y :=QZ 


DOR.MTR 


LAPACK 






TTl 


QfCQi =W 


DSYRDB 


SBR 






TT 


tt2 


QlWQi = T 


DSBRDT 


SBR 






tt3 


TZ = ZA^T,Z 


DSTEMR 


LAPACK 


(2) 






tt4 


Y := Q!Q 2 Z 


DORMTR 


LAPACK 








KEl 


Zfc+l := Cw k 


DSYMV 


BLAS 






KE 


ke2 


Zk+l — > Wfe+1 


DSAUPD 


ARPACK 








ke3 


T m ,V m -> A,Y 


DSEUPD 


ARPACK 




Krylov 




Kll 


w k := U" 1 w k 


DTRSV 


BLAS 




Subsp. 




KI2 


w k := Aw k 


DSYMV 


BLAS 






KI 


KI3 


z k+1 := U~ T w k 


DTRSV 


BLAS 








Kl4 


Zfe+l — > Wh+l 


DSAUPD 


ARPACK 








Kl5 


T m ,V m ^K,Y 


DSEUPD 


ARPACK 


(3) 






BTl 


X:= U- V Y 


DTRSM 


BLAS 



(1): Reduction to standard, GS. (2): Standard Eigenvalue Problem. (3): Back-transform, BT. 



Table 1: Routines from conventional libraries necessary to build the GSYEIG solvers for 
multi-core processors. 



Key 


Experiment 1 (MD), s= 


100 


Experiment 2 (DFT), s 


=448 


TD 


TT 


KE 


KI 


TD 


TT 


KE 


KI 


GSl 


6.60 


6.60 


6.60 


6.60 


36.42 


36.42 


36.42 


36.42 


GS2 


27.54 


27.54 


27.54 




140.35 


140.35 


140.35 




TDl 


67.39 








342.01 








TD2 


0.54 








4.57 








td3 


0.86 








7.81 








TTl 




54.47 








272.86 






tt2 




93.16 








375.67 






tt3 




0.54 








4.57 






tt4 




0.46 








4.53 






KEl 






4.72 








200.65 




ke2 






0.53 








107.44 




ke3 






0.18 








13.38 




Kll 








13.92 








645.93 


KI2 








4.72 








214.07 


KI3 








13.56 








618.37 


Kl4 








0.54 








118.29 


Kl5 








0.18 








13.74 


BTl 


0.31 


0.31 


0.31 


0.31 


2.41 


2.41 


2.41 


2.41 


Tot. 


103.24 


183.08 


39.88 


39.83 


533.57 


836.81 


500.65 


1,649.23 



Table 2: Execution time (in seconds) of the GSYEIG solvers on multi-core processors. 
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the MD and DFT experiments, respectively. These values reflect the needs of 
the associated application. 

In Experiment 1, the execution time for the two variants of the Krylov- 
subspacc approach is approximately the same. The number of ARPACK iter- 
ations that the Krylov-based variants require for this particular eigcnproblem, 
(288 for both ke and ki,) basically balance the higher cost per iteration of ki 
(13.92+13.56=27.48 seconds to compute the two triangular solves, in Kil and 
KI3) with that of building explictly C in ke (27.54 seconds due to GS2). The low 
performance of both tridiagonal-reduction variants (td and tt) can be credited 
to the cost of the reduction to tridiagonal form. Theoretically, this operation 
is not much more expensive than the transformation to standard form (e.g., 
4n 3 /3 flops for TD versus n 3 /3 flops for the Cholesky factorization plus n 3 ad- 
ditional flops for the construction of C). Nevertheless, the fact that half of the 
flops performed in the reduction to tridiagonal form via a direct method (vari- 
ant td) are cast in terms of BLAS-2, explains the high execution time of this 
operation on a multi-core processor. Avoiding this type of low-performance op- 
erations is precisely the purpose of variant tt but, at least for this experiment, 
the introduction of a large overhead in terms of additional number of flops (in 
the accumulation of Q\Qi during the reduction from band to tridiagonal form) 
destroys the benefits of using BLAS-3. Finally, it is worth mentioning that 
the execution time of the tridiagonal eigensolver (operations TD2 and TT2) is 
negligible, validating the choice of MR 3 for this step. 

The situation varies in Experiment 2. Now ke is the fastest variant, fol- 
lowed closely by the tridiagonal-reduction td. The reason lies in the number of 
ARPACK iterations that the Krylov-based variants requires for this eigenprob- 
lem (now quite high, 4,034 for ke and 4,261 for ki), which increases considerably 
the overall cost of the iterative stage, especially for ki. Variant tt is not compet- 
itive, mainly due to the cost of the accumulation of orthogonal transformations 
during the reduction of the band matrix to tridiagonal form (operation tt2). 
For this particular problem and value of s, the MR 3 algorithm applied to the 
tridiagonal eigcnproblem adds only a minor cost to the execution time. 

Table shows the accuracy of the solutions (in terms of relative residual 
and orthogonality) obtained with the four eigensolvers. In Experiment 1, our 
algorithms are applied to the inverse eigenpair (A, B) = (B,A), as computing 
its s largest eigenpairs yields faster convergence in this case; in Experiment 2, 
(A, B) = (A, B). The results show that the accuracy of td and ke are compa- 
rable but there exists a slight degradation of variant ki, which may be due to 
the operation with the upper triangular factor U at each iteration. 

To close our evaluation of the implementations based on conventional li- 
braries, Figure [T] reports the execution times of variants td, ke, and ki for 
different values of s. (Variant tt is not included because the previous experi- 
ments clearly demonstrated that it was not competitive.) The results show a 
rapid increase in the execution time of the variants based on the Krylov-subspace 
as s grows, due to a significant increase in the number of steps that these iter- 
ative procedures require as well as the increment in the costs associated with 
re-orthogonalization and restart, which respectively grow quadratically and lin- 
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Experiment 1 (MD), s =100 


TD 


TT 


KE 


KI 


||/-X T BX|| F 

II Sill* 
||AX-BXA|| F 

max( \\A\\ j?, || JB ||p) 


6.68E-21 
1.03E-16 


6.56E-21 
1.03E-16 


5.58E-21 
1.05E-16 


6.73E-21 
3.80E-16 






Experiment 2 (DFT), s =448 


TD 


TT 


KE 


KI 


|/-X T BX|| F 

II-BUf 
||AX-BXA|| F 
max( | A | F , | B|| F ) 


1.15E-15 
9.80E-16 


2.29E-14 
1.93E-15 


1.35E-15 
6.45E-16 


1.43E-15 
1.93E-14 



Table 3: Accuracy of the GSYEIG solvers built from conventional libraries. 
Experiment 1 (MD) Ex P erimenl 2 ( DFT ) 




Figure 1: Execution time of the GSYEIG solvers on multi-core processors for different values 
of the number of computed eigenpairs s. 

early with m (with m > 2s). This is particularly penalising for variant KI, due 
to its higher cost per iteration. The small increase in the execution time of 
variant TD, on the other hand, is mostly due to the back-transform. 



5. Libraries for Multi-threaded Architectures 

5.1. Task-parallel libraries for multi-core processors 

With the emergence of multi-core processors, and especially with the in- 
crease in the number of processing elements in these architectures, exploit- 
ing task-level parallelism has been recently reported as a successful path to 
improve the performance of both dense linear algebra operations (3. 1 1 ClL l23j 
and sparse linear system solvers [l[; moreover, projects like Cilk, SMPSs, and 
StarPU have demonstrated the assets of leveraging task-based parallelism in 
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more general computations. Modern dense linear algebra libraries that adhere 
to the task-parallel approach include PLASMA and libf lame+SuperMatrix 
(hereafter If +SM). Unfortunately, the current releases of PLASMA (2.4.2) and 
If +SM (5.0) provide only a reduced number of kernels, and for the generalized 
eigenvalue problem, they only cover the initial reduction to STDEIG. Con- 
cretely, lf+SM provides routines FLA_Chol and FLA_Sygst for operations 
GSl and GS2, while PLASMA implements only routine PLASMA_dpotrf for 
the first operation. The performance of these kernels are compared with those of 
LAPACK/BLAS in Table |U The results there show that the use of these task- 
parallel libraries especially benefits those variants which explictly construct C 
(td, tt and ke). In particular, if we consider the effect of the reduction of the 
execution time of GS2 using lf+SM, ke becomes clearly faster than ki in Ex- 
periment 1. In the other experiment, the situation does not vary, as the faster 
solvers were td and ke, and they equally benefit from any improvement to the 
construction of C . 



Key 


Example 1 (MD), s 


=100 


Example 2 (DFT), s 


=448 


LAPACK/BLAS 


lf+SM 


PLASMA 


LAPACK/BLAS 


lf+SM 


PLASMA 


GSl 


6.60 


5.63 


5.13 


36.42 


25.19 


27.97 


GS2 


27.54 


14.18 




140.35 


83.34 





Table 4: Execution time (in seconds) of the task-parallel eigensolvers on multi-core processors. 



5.2. Kernels for CPUs 

The introduction of CPUs with unified architecture and programming style [2C 
posed quite a revolution for the scientific and high-performance community. 
Linear algebra was not an exception, and individual efforts 0, [26| were soon 



followed by projects (e.g., lf+SM, MAGMA, CULA [12j) conducted to improve 
and extend the limited functionality (and sometimes performance) of the im- 
plementation of the BLAS from NVIDIA (CUBLAS). 

As of today, the development of dense linear algebra libraries for GPUs is 
still immature, but certain kernels exist and have demonstrated performance 
worth of being investigated. Table [5] contains a list of GPU kernels related 
to the solution of GSYEIGs. The MAGMA and CUBLAS libraries provide 
routines for the Cholesky factorization, the tridiagonalization, as well as several 
Level 2 and 3 BLAS operations. The reduction from GSYEIG to STDEIG 
is implemented in lf+SM, while routines for the two-stage tridiagonalization 
(SBRG) were developed as part of previous work @, 

5.3. Experimental evaluation of prototype libraries 

Table [6] reports the execution time of the four eigensolvers employing the 
kernels specified in Table [5] Those operations for which no GPU kernel was 
available were computed on the CPU and the corresponding timings are marked 
in bold face in the table. In this case, the time required to transfer the data 
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Stage 


Appr. 


Var. 


Operation 


Routine(s) 


Library 


(1) 






GS1 B= U T U ->■ U 
GS3 C := U^AU' 1 


MAGMA_DPOTRF or 
FLA.OHOL 

FLA_SYGST 

cublasDtrsm or 

MAGMA.DTRSM 


MAGMA or 
lf+SM 
lf+SM 

CUBLAS or 
MAGMA 


(2) 


Trid. 
Reduct. 


TD 


TD1 Q T CQ =T 
TD2 TZ = ZA -> T, Z 
TD3 Y := 


MAGMA_DSYTRD 


MAGMA 


TT 


TTl Q\CQx =w 
TT2 Q^WQ 2 = T 
TT3 TZ = ZA -> T, Z 
TT4 Y := QiQ 2 Z 


GPU_DSYRDB 
GPU-DSBRDT 


SBRG 
SBRG 


Krylov 
Subsp. 


KE 


KEl z fc+ i := Cwfc 

KE2 Zk + i Wk+i 

KE3 T m ,V m ->A,Y 


CUBLASDSYMV Or 
MAGMA_DSYMV 


CUBLAS or 
MAGMA 


KI 


KIl to*. := f/ -1 w k 

KI2 := Aw>fc 

KI3 z k+1 := U' T w k 
KI4 z k +i — > w k +i 

ki5 r m ,y m ->A,y 


CUBLAsDtrsv or 

MAGMA.DTRSV 
CUBLASDSYMV Or 
MAGMA_DSYMV 

CUBLAsDtrsv or 

MAGMA.DTRSV 


CUBLAS or 
MAGMA 

CUBLAS or 
MAGMA 

CUBLAS or 
MAGMA 


(3) 






bti x:= f/ _1 Y 


cublasDtrsm or 

MAGMA.DTRSM 


CUBLAS or 
MAGMA 



(1): Reduction to standard, GS. (2): Standard Eigenvalue Problem. (3): Back-transform, BT. 



Table 5: Routines from modern libraries necessary to build the GSYEIG solvers for multi- 
threaded architectures. 



13 



Key 


Experiment 1 (MD), s =100 


Experiment 2 (DFT), s =448 


TD 


TT 


KE 


KI 


TD 


TT 


KE 


KI 


GS1 
GS2 


1.52 
7.38 


1.52 
7.38 


1.52 
7.38 


1.52 


7.12 
44.17 


7.12 
44.17 


7.12 
44.17 


7.12 


TDl 
TD2 
TD3 


59.08 
0.54 
0.86 








297.84 
4.57 
7.81 








TTl 
TT2 
TT3 
TT4 




31.60 
47.70 
0.54 
0.46 








152.37 
92.18 
4.57 
4.53 






KEl 
KE2 
KE3 






1.79 
0.46 
0.18 








75.31 
123.97 
13.17 




KIl 
KI2 
KI3 
KI4 
KI5 








10.64 
1.79 
11.06 
0.54 
0.18 








296.73 
210.77 
310.47 
121.89 
13.17 


BTl 


0.05 


0.05 


0.05 


0.05 


0.84 


0.84 


0.84 


0.84 


Tot. 


69.43 


89.25 


11.38 


25.78 


362.35 


305.76 


264.58 


970.12 



Table 6: Execution time (in seconds) of the conventional+modcrn GSYEIG solvers on multi- 
threaded architectures. The numbers in boldface are obtained using the LAPACK in place of 
the missing GPU routines. 

between the main memory and the hardware accelerator's memory space is in- 
cluded in the result. (Because of the transfer cost, the timings for the operations 
performed on the CPU are not the same as those reported in Tabic [2|). 

Whenever a GPU kernel was provided by more than a library (e.g., rou- 
tines fla_SYGST from If +SM, CUBLAsDtrsm from CUBLAS or magma_DTRSM 
from MAGMA) we selected the one included in MAGMA. The timings using 
kernels from If +SM for operation GSl were slightly worse than those obtained 
with MAGMA for both Experiments 1 and 2. On the other hand, If +SM out- 
performed the kernel in MAGMA for GS2 in Experiment 2 but was inferior in 
Experiment 1. CUBLAS offered similar or worse performance in all these cases. 

The first observation to make is the remarkable difference between the exe- 
cution time of variant KE when the GPU is employed to accelerate operation GS2 
in Experiment 1: from 27.54 to only 7.28 seconds (a speed-up of 3.73) using, in 
this case, two calls to the triangular system solve from MAGMA. This is com- 
plemented by the lowering of the timing for operation GSl using the Cholesky 
factorization from MAGMA; for this operation, GPUs attain an even higher 
speed-up, 4.34, but on a less dominant stage. Combined, the two stages lead to 
an overall 3.5 x acceleration factor of variant KE, which is now the best method 
for this experiment. While other variants also have to compute the same oper- 
ations, the acceleration reported by the GPU is blurred by the minor cost of GS 
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Experiment 1 (MD), s =100 


TD 


TT 


KE 


KI 


\\I-X T BX 

\\B\\f 
\\AX-BXA 


F 
F 


4.02E-20 
5.41E-16 


4.01E-20 
5.42E-16 


4.04E-20 
5.41E-16 


4.03E-20 
5.72E-16 


max( || A\\ p ,|| h 


\\f) 






Experiment 2 (DFT), s =448 


TD 


TT 


KE 


KI 


\\I~X T BX 


F 


1.61E-14 
5.41E-15 


3.68E-14 
1.56E-15 


1.42E-15 
7.46E-16 


1.38E-15 
5.33E-14 


\\B\\f 
\\AX-BXA 


F 


max( | j4 1| f , || h 


II p) 



Table 7: Accuracy of the convcntional+modern GSYEIG solvers. 

compared with other operations. Experiment 2 also benefits from the use of the 
GPU during the initial transformation from GSYEIG to STDEIG. However, for 
the large experiment and variant KI, we cannot use the matrix-vector products 
in this platform. The matrices involved in this experiment are too large to keep 
two n x n arrays into the GPU memory, one for the triangular factor U and one 
for A. 

While the GPU promises important gains when applied to perform CPU- 
bound computations (with intensive data parallelism), in some cases the results 
are somewhat disappointing. This is the case, for example, of the reduction 
to tridiagonal form in variant TD, which applies the routine DSYTRD (from 
MAGMA library) that shows much slower speed-up on the GPU than expected. 
On the other hand, our GPU implementation of variant TT attains much better 
performance than TD, although it is still slower than the Krylov-based approach 
ke. Besides, the general evaluation is that in some cases the GPU routines in 
these libraries are not directly applicable as, e.g., happens when the data ma- 
trices are too large to fit into the device memory, which in general is much 
smaller than the main memory. This requires a certain knowledge of the nu- 
merical operation, to transform it into a sort of out-of-core routine. While such 
a restructuring is easy for some operations like the triangular system solve with 
multiple right-hand sides, dealing with others like the reduction to band form 
turns out to be quite a complex task (l3j . 

In Table [7] we report the accuracy of the GSYEIG eigensolvers built on 
top of the conventional-l- modern libraries. In Experiment 1, all methods yield 
similar results while, in Experiment 2, the iterative solvers present slightly better 
accuracies. On the other hand, in general there are little qualitative differences 
between the results obtained with conventional and the convcntional+modern 
libraries. 

Figure[2]rc-evaluates the performance of variants TD, ke, and KI as a function 
of s, now leveraging the implementation of these solvers using the kernels in 
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Experiment 1 (MD) Experiment 2 ( DFT > 




Figure 2: Execution time of the conventional+modern GSYEIG solvers for different values of 
the number of computed eigcnpairs s. 

the conventional+modern libraries. The results exhibit a rapid increase in the 
execution time of the variants based on the Krylov-subspace with the dimension 
of s, due to the growth in the number of iterative steps, especially for ki. 

6. Conclusions 

We presented a performance study for the solution of generalized eigen- 
problems on multi-threaded architectures. The focus was on two different ap- 
proaches: the reduction to tridiagonal form — either directly or in successive 
steps — , and the iterative solution through a Krylov method. In both cases, 
we first built the eigensolvers on top of conventional numerical libraries (BLAS, 
LAPACK, SBR, and ARPACK), and then compared with implementations that 
make use of modern multi-threaded libraries (libf lame, PLASMA, MAGMA, 
and CUBLAS) as well as a few GPU kernels that we developed ourselves. As 
testbeds, we chose matrices arising in large-scale molecular dynamics and den- 
sity functional theory; in both applications, only a portion of the lower part of 
the spectrum is of interest. The results are representative of the benefits that 
one should expect from GPUs and multi-threaded libraries; moreover, they indi- 
cate that in realistic applications, when only 3-5% of the spectrum is required, 
the Krylov-subspace solver is to be preferred. 
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